Hone: "Scaling Down" Hadoop on Shared-Memory Systems
نویسندگان
چکیده
The underlying assumption behind Hadoop and, more generally, the need for distributed processing is that the data to be analyzed cannot be held in memory on a single machine. Today, this assumption needs to be re-evaluated. Although petabyte-scale datastores are increasingly common, it is unclear whether “typical” analytics tasks require more than a single high-end server. Additionally, we are seeing increased sophistication in analytics, e.g., machine learning, which generally operates over smaller and more refined datasets. To address these trends, we propose “scaling down” Hadoop to run on shared-memory machines. This paper presents a prototype runtime called Hone, intended to be both API and binary compatible with standard (distributed) Hadoop. That is, Hone can take an existing Hadoop jar and efficiently execute it, without modification, on a multi-core shared memory machine. This allows us to take existing Hadoop algorithms and find the most suitable runtime environment for execution on datasets of varying sizes. Our experiments show that Hone can be an order of magnitude faster than Hadoop pseudo-distributed mode (PDM); on dataset sizes that fit into memory, Hone can outperform a fully-distributed 15-node Hadoop cluster in some cases as well.
منابع مشابه
Optimization Techniques for "Scaling Down" Hadoop on Multi-Core, Shared-Memory Systems
The underlying assumption behind Hadoop and, more generally, the need for distributed processing is that the data to be analyzed cannot be held in memory on a single machine. Today, this assumption needs to be re-evaluated. Although petabyte-scale datastores are increasingly common, it is unclear whether “typical” analytics tasks require more than a single high-end server. Additionally, we are ...
متن کاملHopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases
Recent improvements in both the performance and scalability of shared-nothing, transactional, in-memory NewSQL databases have reopened the research question of whether distributed metadata for hierarchical file systems can be managed using commodity databases. In this paper, we introduce HopsFS, a next generation distribution of the Hadoop Distributed File System (HDFS) that replaces HDFS’ sing...
متن کاملParallelization Strategies for Distributed Non Negative Matrix Factorization
Dimensionality reduction and clustering have been the subject of intense research efforts over the past few years [2]. They offer an approach of knowledge extraction from huge amounts of data. Although some of these techniques are effective at achieving lower data dimensions, very few focused on scaling the techniques to tackle data sets that might not fit into memory. Non negative matrix facto...
متن کاملParallel algorithms for clustering biological graphs on distributed and shared memory architectures
Graph algorithms on parallel architectures present an interesting case study for irregular applications. In this paper, we address one such irregular application — one of clustering real world graphs constructed out of biological data using parallel computers. While theoretical formulations of the clustering operation are either intractable or computationally prohibitive, efficient heuristics e...
متن کاملA Distributed Phoenix++ Framework for Big Data Recommendation Systems
Recommendation systems are important big data applications that are used in many business sectors of the global economy. While many users utilize Hadoop-like MapReduce systems to implement recommendation systems, we utilize the highperformance shared-memory MapReduce system Phoenix++ to design a faster recommendation engine. In this paper, we design a distributed out-ofcore recommendation algor...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PVLDB
دوره 6 شماره
صفحات -
تاریخ انتشار 2013